The goal of this machine learning project is to attempt to predict the genre of about 50000 tracks obtained from the Spotify API, via Kaggle, as one of the following:
‘Electronic’, ‘Anime’, ‘Jazz’, ‘Alternative’, ‘Country’, ‘Rap’, ‘Blues’, ‘Rock’, ‘Classical’, or ‘Hip-Hop’.
Spotify is a music streaming platform with 406 million monthly users. Here is their “About Us” page for some more info:
[1]: https://newsroom.spotify.com/company-info/
According to Oxford Dictionary, genre is a category of artistic composition, characterized by similarities in form, style, or subject matter. With this project, I hope to gain insight into the question of whether genre is inherent to the nature of music, or if it is a product of human nature and our tendencies to look for patterns in the world around us. So, we begin…
Starting Off: Loading Data and Taking a Look
For a definition of the numerous variables discussed throughout this project, reference my attached codebook.
#reading in data file
music<-read.csv("/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/data/music_genre.csv")
glimpse(music) #taking a look at the initial dataset
## Rows: 50,005
## Columns: 18
## $ instance_id <dbl> 32894, 46652, 30097, 62177, 24907, 89064, 43760, 3073…
## $ artist_name <chr> "Röyksopp", "Thievery Corporation", "Dillon Francis",…
## $ track_name <chr> "Röyksopp's Night Out", "The Shining Path", "Hurrican…
## $ popularity <dbl> 27, 31, 28, 34, 32, 47, 46, 43, 39, 22, 30, 27, 31, 3…
## $ acousticness <dbl> 4.68e-03, 1.27e-02, 3.06e-03, 2.54e-02, 4.65e-03, 5.2…
## $ danceability <dbl> 0.652, 0.622, 0.620, 0.774, 0.638, 0.755, 0.572, 0.80…
## $ duration_ms <dbl> -1, 218293, 215613, 166875, 222369, 519468, 214408, 4…
## $ energy <dbl> 0.941, 0.890, 0.755, 0.700, 0.587, 0.731, 0.803, 0.70…
## $ instrumentalness <dbl> 7.92e-01, 9.50e-01, 1.18e-02, 2.53e-03, 9.09e-01, 8.5…
## $ key <chr> "A#", "D", "G#", "C#", "F#", "D", "B", "G", "F", "A",…
## $ liveness <dbl> 0.1150, 0.1240, 0.5340, 0.1570, 0.1570, 0.2160, 0.106…
## $ loudness <dbl> -5.201, -7.043, -4.617, -4.498, -6.266, -10.517, -4.2…
## $ mode <chr> "Minor", "Minor", "Major", "Major", "Major", "Minor",…
## $ speechiness <dbl> 0.0748, 0.0300, 0.0345, 0.2390, 0.0413, 0.0412, 0.351…
## $ tempo <chr> "100.889", "115.00200000000001", "127.994", "128.014"…
## $ obtained_date <chr> "4-Apr", "4-Apr", "4-Apr", "4-Apr", "4-Apr", "4-Apr",…
## $ valence <dbl> 0.7590, 0.5310, 0.3330, 0.2700, 0.3230, 0.6140, 0.230…
## $ music_genre <chr> "Electronic", "Electronic", "Electronic", "Electronic…
n_distinct(music$music_genre) #how many genres are there?
## [1] 11
Right off the bat, I notice some things about this dataset. First, there are 11 distinct genres present, but only 10 listed on the Kaggle codebook from which I obtained this data. Maybe there are some null values present. Second, most of the features of this data are numerical, besides names of tracks and artist, obtained date, key, and mode. Oddly, tempo is recorded as a character variable, but the entries look to be numerical, so this is something I will have to address later on in the project. Key and mode are naturally categorical data, so they will have to be addressed via dummy or one-hot coding later on. Finally, not all of the above variables will be useful in identifying the genre of given tracks, so I will either remove them or replace them with characteristics derived from them.
To take a closer look at the distribution of each numerical variable:
summary(music) # to see distributions of the numerical variables
## instance_id artist_name track_name popularity
## Min. :20002 Length:50005 Length:50005 Min. : 0.00
## 1st Qu.:37974 Class :character Class :character 1st Qu.:34.00
## Median :55914 Mode :character Mode :character Median :45.00
## Mean :55888 Mean :44.22
## 3rd Qu.:73863 3rd Qu.:56.00
## Max. :91759 Max. :99.00
## NA's :5 NA's :5
## acousticness danceability duration_ms energy
## Min. :0.0000 Min. :0.0596 Min. : -1 Min. :0.000792
## 1st Qu.:0.0200 1st Qu.:0.4420 1st Qu.: 174800 1st Qu.:0.433000
## Median :0.1440 Median :0.5680 Median : 219281 Median :0.643000
## Mean :0.3064 Mean :0.5582 Mean : 221253 Mean :0.599755
## 3rd Qu.:0.5520 3rd Qu.:0.6870 3rd Qu.: 268612 3rd Qu.:0.815000
## Max. :0.9960 Max. :0.9860 Max. :4830606 Max. :0.999000
## NA's :5 NA's :5 NA's :5 NA's :5
## instrumentalness key liveness loudness
## Min. :0.000000 Length:50005 Min. :0.00967 Min. :-47.046
## 1st Qu.:0.000000 Class :character 1st Qu.:0.09690 1st Qu.:-10.860
## Median :0.000158 Mode :character Median :0.12600 Median : -7.277
## Mean :0.181601 Mean :0.19390 Mean : -9.134
## 3rd Qu.:0.155000 3rd Qu.:0.24400 3rd Qu.: -5.173
## Max. :0.996000 Max. :1.00000 Max. : 3.744
## NA's :5 NA's :5 NA's :5
## mode speechiness tempo obtained_date
## Length:50005 Min. :0.02230 Length:50005 Length:50005
## Class :character 1st Qu.:0.03610 Class :character Class :character
## Mode :character Median :0.04890 Mode :character Mode :character
## Mean :0.09359
## 3rd Qu.:0.09853
## Max. :0.94200
## NA's :5
## valence music_genre
## Min. :0.0000 Length:50005
## 1st Qu.:0.2570 Class :character
## Median :0.4480 Mode :character
## Mean :0.4563
## 3rd Qu.:0.6480
## Max. :0.9920
## NA's :5
At a glance, I see that most of the variables are recorded in the range of 0-1, besides popularity, duration, and loudness. This makes sense. Instance ID seems to be some sort of index, so theres no real meaning to its distribution.
Onto data cleaning!
n_distinct(music$obtained_date) #unique dates
## [1] 6
Right off the bat, I can determine that obtained date won’t be helpful in predicting genre, as it has nothing to do with the individual tracks but rather when the data was recorded from Spotify; thus, I will go ahead and remove it before any of my EDA, so as not to distract from my analysis. Instance ID is just an index so it will be removed as well:
#remove obtained date, instance id
music <- music %>%
select(-obtained_date, -instance_id)
I will begin by checking for NULL values, or missing data in the dataset:
music[rowSums(is.na(music)) > 0, ] #how many rows are fully null?
I see 5 observations that are filled with null values. Since these are so few, I will just go ahead and remove them:
music <- music %>%
drop_na() #remove null values
count(music) #how many observations left?
head(music) #first 6 observations of the dataset
That leaves us with 50000 observations, or individual tracks, to explore, the first 6 of which are displayed above.
Time to deal with the mysterious character tempo, by simply converting it to a double as it should be:
music$tempo <- as.double(as.character(music$tempo))
That is as much data cleaning as I can conduct initially. Perhaps the EDA will reveal more issues with the data that need to be sorted out (spoiler: it does).
I will examine each variable’s distribution by genre, as well as compare some variables against eachother.
GENRE
music %>%
ggplot(aes(music_genre, fill=music_genre)) + #color by genre
geom_bar() + #bar plot of genre counts
labs(title = "Distribution of Genre" )
I can see that is data is perfectly balanced (aka it has equal distribution among genres). Upon further research, I learned that this means accuracy will be a good metric to use to determine the goodness of fit of the model later on.
ARTIST NAME
n_distinct(music$artist_name) #how many unique artists?
## [1] 6863
n_distinct(music$track_name) #how many unique songs?
## [1] 41699
#some artist names are marked as "empty field"; how many?
sum(music$artist_name=="empty_field")
## [1] 2489
music %>%
filter(str_detect(artist_name, "empty_field")) %>%
head()
There are 6863 unique artists present in the data, 41699 unique tracks (meaning there must be some duplicate tracks), and 2489 artist names marked as “empty_field”. I won’t remove these observations, because they might still be useful in helping the model predict the genre of other tracks.
In order to make artist names possibly a bit more useful to the model, I will instead try to use the length of the names (as in how many characters are present in the string) to glean some information about their relation to genre. I will store the lengths in a new column in the dataset:
music$Aname_length = str_length(music$artist_name) #new column of artist name lengths
head(music$Aname_length) #first 6 observations
## [1] 8 20 14 8 11 10
Lets look at the distribution of artist name length by genre.
music %>%
ggplot(aes(x=music_genre, y=Aname_length, fill=music_genre)) + #color by genre
geom_boxplot() + #boxplots
labs(title = "Length of Artist Name by Genre", x="Genre", y="Artist Name Length" )
Generally, classical music artists tend to have longer names, while the rest of the genres are quite similarly distributed. This could be useful.
TRACK NAME
Looking at length of track names and creating a new column in the data to store them:
music$Tname_length = str_length(music$track_name) #new column of track name lengths
head(music$Tname_length) #first 6 observations
## [1] 20 16 9 5 16 5
and once again their distributions by genre:
music %>%
ggplot(aes(x=music_genre, y=Tname_length, fill=music_genre)) + #color by genre
geom_boxplot() + #boxplots
labs(title = "Length of Track Name by Genre", x="Genre", y="Track Name Length" )
Here, there is a much more pronounced difference than in lengths of artist names. Generally, name length of classical tracks is greater than any other genre. I’ll keep this in mind.
POPULARITY
#distribution of popularity by genre
music %>%
ggplot(aes(reorder(music_genre, popularity, sum), y=popularity, fill=music_genre)) +
geom_col() + #barplot
labs(title = "Distribution of Popularity by Genre", x="Genre", y="Popularity (Totaled)")
Rap, rock, and hip-hop are the most popular genres, while anime and classical are least popular, with the rest sitting somewhere in the middle.
ACOUSTICNESS
music %>%
ggplot(aes(x=music_genre, y=acousticness, fill=music_genre)) + #color by genre
geom_boxplot() + #boxplots
labs(title = "Acousticness by Genre", x="Genre", y="Acousticness" )
Classical is once again an outlier to the rest of the data, as well as jazz (makes sense, as classical and jazz music typically consist entirely of acoustic instruments, and very little electronic production). All other genres seem somewhat similarly distributed as being lower in acousticness. NOTE: rap and hip-hop distributions seem to consistently correspond, which makes sense because the genres are so related. I suspect acousticness to be correlated with energy, and I will explore this when I examine the energy predictor later on.
DANCEABILITY
music %>%
ggplot(aes(x=music_genre, y=danceability, fill=music_genre)) + #color by genre
geom_boxplot() + #boxplots
labs(title = "Danceability by Genre", x="Genre", y="Danceability" )
Classical is noticeably the least danceable genre, where as hip-hop/rap are the most danceable, and all other genres are nearly the same.
DURATION
summary(music$duration_ms) #numeric distribution of duration in milliseconds
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1 174800 219281 221253 268612 4830606
There is an issue here. -1 is not a valid measurement of time, so there must be missing values. Let’s take a closer look:
sum(music$duration_ms=="-1")
## [1] 4939
This is a large number of observations with missing/ invalid values of duration. I will choose to fill these missing values with the median of the duration data, so as not to lose the duration variable by having to remove it:
music <- music %>%
mutate(duration_ms = ifelse(duration_ms==-1, #fill missing values with median
median(duration_ms, na.rm = T),
duration_ms))
sum(music$duration_ms==-1)
## [1] 0
Now that I filled in the missing values, lets look at the distribution by genre:
music %>%
ggplot(aes(x=music_genre, y=duration_ms, fill=music_genre)) + #color by genre
geom_boxplot() + #boxplots
labs(title = "Track Duration by Genre", x="Genre", y="Duration in Milliseconds" )
There is an extreme outlier present in the electronic genre, as well as outliers in the classical and blues genres, but since the medians of each variable are similarly distributed, I’ll ignore the outliers. Overall, it seems the duration for classical tracks tends to be slightly longer than for any other genre.
INSTRUMENTALNESS
summary(music$instrumentalness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.000000 0.000158 0.181601 0.155000 0.996000
The minimum and 1st quartile being 0 and median and mean being very small indicate an issue in the data. A plot should give us a better look:
music %>%
ggplot(aes(instrumentalness)) + #distribution of instrumentalness
geom_histogram(fill="#00A36C") +
labs(title = "Distribution of Instrumentalness", x="Instrumentalness")
sum(music$instrumentalness==0.0)
## [1] 15001
It seems a large portion (15001 of 50000 observations, or 30%) of the instrumentalness observations equal 0. This is indicative of missing values rather than actual data points, and that is too many missing values to deal with by replacing them with the mean or median, so I will drop instrumentalness entirely from the dataset, rather than use it to build my models.
music <- music %>%
select(-instrumentalness)
ENERGY
music %>%
ggplot(aes(x=music_genre, y=energy, fill=music_genre)) + #color by genre
geom_boxplot() + #boxplots
labs(title = "Energy by Genre", x="Genre", y="Energy" )
Classical continues to stand apart from the rest of the genres, here in that it tends to be much less energetic, when compared to other genres.
Energy logically seems to correlate with certain variables: acousticness, liveness, loudness, and tempo. Instead of examining each of these variables individually, I will plot energy against each of them and separate the results by genre:
ACOUSTICNESS
music %>%
ggplot(aes(x=energy, y=acousticness, color=music_genre)) + #color by genre
geom_point(alpha=0.05) + #scatterplot
facet_wrap(~music_genre, scales = "free") + #separate graphs by genre
geom_smooth(se = FALSE, color = "black", size = 1) + #add curved line
theme(legend.position="none") + #remove legend (it was not useful)
labs(title = "Energy vs Acousticness by Genre", x="Energy", y="Acousticness")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
I see a strong negative correlation, meaning the more energetic a song, the less acoustic it is, which is the opposite of what I originally believed.
LIVENESS
Now to compare liveness and genre:
music %>%
ggplot(aes(x=energy, y=liveness, color=music_genre)) + #color by genre
geom_point(alpha=0.05) + #scatterplot
facet_wrap(~music_genre, scales = "free") + #separate graphs by genre
geom_smooth(se = FALSE, color = "black", size = 1) + #add curved line
theme(legend.position="none") + #remove legend
labs(title = "Energy vs Liveness by Genre", x="Energy", y="Liveness")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There is very little correlation between these 2, which is surprising to me as liveness seems to indicate something energetic. This leads me to believe that liveness here is more of a measure of how “live”, as in not simply recorded in a studio, the tracks are by genre. I will make note of this in my codebook.
LOUDNESS
music %>%
ggplot(aes(x=energy, y=loudness, color=music_genre)) + #color by genre
geom_point(alpha=0.05) + #scatterplot
facet_wrap(~music_genre, scales = "free") + #separate graphs by genre
geom_smooth(se = FALSE, color = "black", size = 1) + #add curved line
theme(legend.position="none") + #remove legend
labs(title = "Energy vs Loudness by Genre", x="Energy", y="Loudness")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Here, I see a strong positive correlation. This make complete sense, as loudness is a measure of sound, and sound, of course, is a form of energy.
TEMPO
I believe there are some missing values for the tempo predictor:
sum(is.na(music$tempo))
## [1] 4980
Lets replace the missing values of tempo with the median of the tempo data first:
music <- music %>%
mutate(tempo = ifelse(is.na(tempo), #fill missing values with median
median(tempo, na.rm = T),
tempo))
sum(is.na(music$tempo))
## [1] 0
Now to plot tempo against energy by genre:
music %>%
ggplot(aes(x=energy, y=tempo, color=music_genre)) + #color by genre
geom_point(alpha=0.05) + #scatterplot
facet_wrap(~music_genre, scales = "free") + #separate graphs by genre
geom_smooth(se = FALSE, color = "black", size = 1) + #add curved line
theme(legend.position="none") + #remove legend
labs(title = "Energy vs Tempo by Genre", x="Energy", y="Tempo")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
There is very little correlation between tempo and energy, so I was wrong in assuming they were correlated.
KEY
What keys of music are present in the dataset?
unique(music$key) #what distinct keys are there in the data
## [1] "A#" "D" "G#" "C#" "F#" "B" "G" "F" "A" "C" "E" "D#"
Because key is categorical, I will opt for barplots separated by genre to examine the distribution:
music %>%
ggplot(aes(x=key, fill=key)) + #color by genre
geom_bar() + #barplots
facet_wrap(~music_genre, scales = "free") + #separate by genre for readability
labs(title = "Key Distribution by Genre", x="Key")
Each genre looks to have a distinct spead of key, which means this will be a helpful variable in building the model.
Because key is a categorical variable, I will have to create dummy variables/ use One Hot Encoding to use it in my model. One Hot Encoding is when you separate the factors into various columns, and use 1s and 0s to indicate whether or not a track falls under that key.
music$key <- as.factor(music$key) #convert from character to factor
music <- one_hot(as.data.table(music)) #one hot encode
MODE
There are only two modes which music can fall under:
unique(music$mode)
## [1] "Minor" "Major"
To put it broadly, major songs tend to sound more happy, while minor songs tend to sound more dark or sad.
mycolors <- c("#FFBF00", "#00A36C") #colors of plot chosen by my sister
music %>%
group_by(mode, music_genre) %>% #to separate by mode, then genre
count() %>% #how many observations per mode type
ggplot(aes(music_genre, n, fill = mode)) + #color by mode
geom_col(position="dodge") + #side by side
scale_fill_manual(values=mycolors) + #apply chosen colors
labs(title = "Mode Distribution by Genre", x="Genre", y="count")
There seems to be a preference towards the major key for all genres, particularly country.
I will use One Hot Encoding for mode as well:
music$mode <- as.factor(music$mode) #convert from character to factor
music <- one_hot(as.data.table(music)) #one hot encode
SPEECHINESS
music %>%
ggplot(aes(x=music_genre, y=speechiness, fill=music_genre)) + #color by genre
geom_boxplot() + #boxplots
labs(title = "Speechiness by Genre", x="Genre", y="Speechiness" )
Rap and hip-hop stick out as being particularly speechy, which should be helpful for later identification. Classical and country are hardly speechy at all.
VALENCE
With regards to music, valence is a measure of percieved “positivity” within a song. The higher the valence, the more “upbeat” the song sounds, and vice versa.
music %>%
ggplot(aes(x=music_genre, y=valence, fill=music_genre)) + #color by genre
geom_boxplot() + #boxplots
labs(title = "Valence by Genre", x="Genre", y="Valence" )
Classical stands out as having lower valence on average than other genres.
EDA Final Touches
Because I extracted the lengths of both the artist and track names, I will drop the original variables and keep the new ones to maintain independence among predictors.
music <- music %>%
select(-artist_name, -track_name) #drip artist and track name
Overall, I notice that for a lot of the features, genres tend to have very similar distributions to each other, with the exception of one or two genres each time. This tells me it will be hard to build a model that flawlessly distinguishes between genre every time.
Thus, I will focus on building and choosing the best possible model, even if it is not perfectly accurate in nature.
First, I will separate labels (predictions) from features (predictors):
music_features <- music %>%
select(-music_genre) #predictors
music_labels <- music %>%
select(music_genre) #predictions
Next, I will scale the features, which means I will center the data of each predictor variable around 0, and normalize it to have a standard deviation of 1. This will help keep things even across the board when when building and comparing the models. I will separate the One Hot Encoded variables from the data prior to scaling, as scaling and normalizing a bunch of 1s and 0s does not make any sense intuitively (like taking the average of true and false). Then, I will scale the necessary features and reattach the two data frames. Finally, I will transform genre into a factor rather than a set of character variables to I can actually go about making predictions.
#separate numeric features and place in one data frame
to_scale <- music_features %>%
select(popularity, acousticness, danceability, duration_ms, energy, liveness, loudness,
speechiness, tempo, valence, Aname_length, Tname_length)
#separate one hot encoded features and place in second data frame
features_encoded <- music_features %>%
select(-popularity, -acousticness, -danceability, -duration_ms, -energy, -liveness,
-loudness, -speechiness, -tempo, -valence, -Aname_length, -Tname_length)
#scale the chosen numerical features
features_scaled <- to_scale %>%
scale() %>%
as.data.frame()
colMeans(features_scaled) #all equal (basically) 0
## popularity acousticness danceability duration_ms energy
## 1.782835e-16 -3.944527e-17 -1.208429e-16 -6.445851e-17 -2.622440e-17
## liveness loudness speechiness tempo valence
## -1.030032e-16 1.521437e-16 -6.332726e-17 1.176359e-16 9.334596e-17
## Aname_length Tname_length
## -1.799978e-16 -5.442050e-17
#rejoin the 2 data frames into one
features_processed <- as.data.frame(c(features_scaled, features_encoded))
#make genre a factor rather than character type
features_processed$music_genre <- as.factor(music_labels$music_genre)
glimpse(features_processed)
## Rows: 50,000
## Columns: 27
## $ popularity <dbl> -1.10799194, -0.85062494, -1.04365019, -0.65759969, -0.78…
## $ acousticness <dbl> -0.8838774, -0.8603818, -0.8886234, -0.8231755, -0.883965…
## $ danceability <dbl> 0.52487283, 0.35692975, 0.34573354, 1.20784137, 0.4464993…
## $ duration_ms <dbl> -0.222787729, -0.232101866, -0.257366933, -0.716832921, -…
## $ energy <dbl> 1.28986301, 1.09708957, 0.58680694, 0.37891401, -0.048211…
## $ liveness <dbl> -0.48810859, -0.43242829, 2.10411858, -0.22826720, -0.228…
## $ loudness <dbl> 0.63812554, 0.33924462, 0.73288475, 0.75219356, 0.4653198…
## $ speechiness <dbl> -0.185319962, -0.627251443, -0.582861004, 1.434437834, -0…
## $ tempo <dbl> -0.655412965, -0.170024909, 0.276808622, 0.277496481, 0.8…
## $ valence <dbl> 1.2250610, 0.3024276, -0.4988067, -0.7537449, -0.5392731,…
## $ Aname_length <dbl> -0.7959179, 1.6583842, 0.4312332, -0.7959179, -0.1823424,…
## $ Tname_length <dbl> -0.01570766, -0.24836201, -0.65550713, -0.88816149, -0.24…
## $ key_A <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
## $ key_A. <int> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_B <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ key_C <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ key_C. <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ key_D <int> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
## $ key_D. <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_E <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_F <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_F. <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_G <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, …
## $ key_G. <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, …
## $ mode_Major <int> 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, …
## $ mode_Minor <int> 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, …
## $ music_genre <fct> Electronic, Electronic, Electronic, Electronic, Electroni…
Now that I have my final version of the dataset, processed as necessary, I will split the data into a training and test set upon which to analyze and compare my model predictions to.
Here, I split the data into an 80% training, 20% test set, using stratified sampling. I used stratified sampling to increase randomness, particularly in the genre, as the dataset looks to have long stretches of entries with the same genre in a row.
set.seed(123) #to ensure replicability when randomly splitting and stratifying data
music_split <- features_processed %>%
initial_split(prop = 0.8, strata = "music_genre") #80/20 split with stratified sampling
music_train <- training(music_split) #80 goes to training set
music_test <- testing(music_split) #20 goes to test set
glimpse(music_train) #taking a look at training set
## Rows: 40,000
## Columns: 27
## $ popularity <dbl> 0.24318479, 0.11450129, -0.07852396, -0.07852396, 0.43621…
## $ acousticness <dbl> 0.15414918, -0.02162863, -0.02748789, -0.86911206, -0.896…
## $ danceability <dbl> -0.757092705, 0.978319157, 0.603246271, 0.524872832, -0.8…
## $ duration_ms <dbl> 1.75240318, -0.19916301, 0.18521483, -0.15630666, -0.1208…
## $ energy <dbl> 0.522549122, -0.588733064, 0.530108864, 0.806039475, 0.88…
## $ liveness <dbl> -0.7040244, -0.5561623, -0.3025076, 0.5945639, 0.1676816,…
## $ loudness <dbl> 0.29283857, 0.31863774, 0.52340849, 0.71341368, 0.9217541…
## $ speechiness <dbl> -0.6085088, 0.2112346, -0.5532674, -0.5187415, -0.5059176…
## $ tempo <dbl> -0.293426903, -1.060665389, 0.170775080, 0.139443080, 2.3…
## $ valence <dbl> -0.72946507, -0.34098785, 0.66257831, 0.41978004, 0.76374…
## $ Aname_length <dbl> 0.6357583, -0.3868676, -0.1823424, -0.5913927, -1.2049683…
## $ Tname_length <dbl> -0.19019842, -0.77183431, -0.13203484, -0.42285278, -0.30…
## $ key_A <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_A. <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_B <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ key_C <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, …
## $ key_C. <int> 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, …
## $ key_D <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, …
## $ key_D. <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_E <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_F <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_F. <int> 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_G <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ key_G. <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
## $ mode_Major <int> 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, …
## $ mode_Minor <int> 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, …
## $ music_genre <fct> Alternative, Alternative, Alternative, Alternative, Alter…
While training my models using repeated cross validation, I learned that it was quite a lengthy process on my machine. Thus, I decided to write my models to external .rda files, so that I could load the files in later on to access the models without having to rerun the cross validation, or force the knittr to rerun cross validation every time I tried to knit my project (one attempt to knit even took over 20 hours before I decided to do this). Quickly grabbing my current working directory so I know where to write the files to..
getwd()
## [1] "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject"
and now onto actually training and building the models. Exciting!
For my project, because I am working to predict categorical data, I decided to build a few classification models, including the following, using repeated cross validation. I did so primarily using the caret package, so that the training data could be folded within the building of the model, using the trainControl() function:
Random Forest
Over the many times I had to run this process, it took an average of two hours to run each time; originally, I attempted to train and tune the model with repeated cross validation, but the runtime was simply too exhaustive on my machine (over 4 hours, and going) for me to justify keeping it in. Thus, I trained this model using only 10 fold cross validation with no repeats.
I chose a maximum of 25 for my tuning grid for the mtry parameter because my training set contains 26 predictors.
After training the model and saving it to an external .rda file, I completely commented the code out, as setting cache=TRUE in the r header was being ignored by the knittr for some reason.
#rf.fitControl <- trainControl(method="cv", number=10) 10 fold cross validation
#tunegrid <- expand.grid(.mtry=c(2:25)) between 2 and 25 for my parameters
#rf_music2 <- train(music_genre ~., data=music_train, method="rf",
# tuneGrid=tunegrid, ntree = 100, trControl=rf.fitControl)
#rf_music2
#save(rf_music2, file = "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/rfModel.rda")
The optimal value for the mtry parameter was determined by accuracy=55%, and set to be 7 by the tuning process. Now to load the model back in from the external .rda so I can actually examine it and use it to predict on the test data.
load("rfModel.rda") #load random forest model from external file
I plotted the performance of the random forest model to see the progression of the tuning process:
ggplot(rf_music2) #performance plot of random forest
It seems performance of the model rapidly increased until mtry=7, then slowly tapered off as the mtry parameter value increased. Interestingly, an mtry value of 11 came in a close second to the optimal value.
Now, to build the random forest model with the optimal value of mtry achieved by tuning and predict on the test set. I created a confusion matrix so as to see the ratio of correct predictions to incorrect predictions by the model:
rf.music <- predict(rf_music2, music_test) #predicting on test set
#building confusion matrix to compare predictions to actual test data
confusionMatrix(reference = music_test$music_genre, data = rf.music, mode='everything')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alternative Anime Blues Classical Country Electronic Hip-Hop Jazz
## Alternative 377 28 52 24 63 74 35 29
## Anime 8 775 76 35 10 49 1 17
## Blues 18 45 487 28 42 73 2 136
## Classical 3 50 11 851 0 5 0 42
## Country 94 18 82 9 550 29 5 54
## Electronic 49 52 58 13 38 589 14 113
## Hip-Hop 95 0 6 0 11 25 399 36
## Jazz 83 22 173 34 78 107 15 520
## Rap 59 0 1 0 12 19 483 13
## Rock 214 10 54 6 196 30 46 40
## Reference
## Prediction Rap Rock
## Alternative 42 161
## Anime 2 5
## Blues 0 15
## Classical 0 2
## Country 5 100
## Electronic 5 1
## Hip-Hop 559 43
## Jazz 3 16
## Rap 291 64
## Rock 93 593
##
## Overall Statistics
##
## Accuracy : 0.5432
## 95% CI : (0.5334, 0.553)
## No Information Rate : 0.1
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4924
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Alternative Class: Anime Class: Blues
## Sensitivity 0.3770 0.7750 0.4870
## Specificity 0.9436 0.9774 0.9601
## Pos Pred Value 0.4260 0.7924 0.5757
## Neg Pred Value 0.9317 0.9751 0.9440
## Precision 0.4260 0.7924 0.5757
## Recall 0.3770 0.7750 0.4870
## F1 0.4000 0.7836 0.5276
## Prevalence 0.1000 0.1000 0.1000
## Detection Rate 0.0377 0.0775 0.0487
## Detection Prevalence 0.0885 0.0978 0.0846
## Balanced Accuracy 0.6603 0.8762 0.7236
## Class: Classical Class: Country Class: Electronic
## Sensitivity 0.8510 0.5500 0.5890
## Specificity 0.9874 0.9560 0.9619
## Pos Pred Value 0.8828 0.5814 0.6320
## Neg Pred Value 0.9835 0.9503 0.9547
## Precision 0.8828 0.5814 0.6320
## Recall 0.8510 0.5500 0.5890
## F1 0.8666 0.5653 0.6097
## Prevalence 0.1000 0.1000 0.1000
## Detection Rate 0.0851 0.0550 0.0589
## Detection Prevalence 0.0964 0.0946 0.0932
## Balanced Accuracy 0.9192 0.7530 0.7754
## Class: Hip-Hop Class: Jazz Class: Rap Class: Rock
## Sensitivity 0.3990 0.5200 0.2910 0.5930
## Specificity 0.9139 0.9410 0.9277 0.9234
## Pos Pred Value 0.3399 0.4948 0.3089 0.4626
## Neg Pred Value 0.9319 0.9464 0.9217 0.9533
## Precision 0.3399 0.4948 0.3089 0.4626
## Recall 0.3990 0.5200 0.2910 0.5930
## F1 0.3671 0.5071 0.2997 0.5197
## Prevalence 0.1000 0.1000 0.1000 0.1000
## Detection Rate 0.0399 0.0520 0.0291 0.0593
## Detection Prevalence 0.1174 0.1051 0.0942 0.1282
## Balanced Accuracy 0.6564 0.7305 0.6093 0.7582
The accuracy for mtry=7 random forest model is about 54% on test data, almost exactly like on training data (55%). I believe this indicates that the model did not overfit to the training set, which I was slightly worried about because of the 80/20 ratio I used for the training and test split.
From the confusion matrix, I see that the model often confused hip-hop for rap, and rap for hip-hop, which makes sense because of the similarity of the genres. This makes me believe the models I fit may perform better if I were to combine the two genres, but I wanted to keep the data as pure as possible for this project to test the ability of specific genre distinctions.
Rock/ alternative and rock/ country also had similar overlap within the confusion matrix.
Now, to take a look at which variables were most important to the model:
varImpRF<-varImp(rf_music2) #order of variable importance to random forest model
ggplot(varImpRF, main="Variable Importance with Random Forest") #plot importance
Popularity was by far the most important feature when it comes to predicting genre, along with loudness, speechiness, and danceability (I assume for distinguishing rap and hip-hop). Key and mode seem to be least important, but this is a misclassification of key particularly, as it was split 12 ways when it was One Hot Encoded earlier. Therefore, I will take mode and liveness to be the true least important variables.
Because popularity is so crucial to the determination of genre, I am led to believe that genre is less of an inherest characteristic of music, but rather of phenomenon of how humans interact with music, and attempt to find patterns in it. More on this later…
Boosted Trees
Next, I decided to fit a boosted tree model to the training set, and train the model using 10 fold cross validation, repeated three times to keep my machine from suffering through 3 hour+ runtimes. Training and tuning this model took just under 3 hours. Once again, I wrote the model to an external file to help keep things quick and simple.
#gbmFitControl <- trainControl(## 10-fold CV
# method = "repeatedcv",
# number = 10,
# ## repeated three times
# repeats = 3)
#gbmFit1 <- train(music_genre ~ ., data = music_train,
# method = "gbm",
# trControl = gbmFitControl)
#gbmFit1
#save(gbmFit1, file = "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/boostModel.rda")
The results of the optimal tuned parameters are as follows:
150: Optimal number of trees (number of iterations)
3: Optimal interaction depth (complexity of the tree)
0.1: Optimal learning rate (how fast the algorithm adapts, \(\lambda\))
Now to load the boost model from the external file:
load("boostModel.rda") #loading boost model from external file
Plotting the performance of the boosted model:
ggplot(gbmFit1)
I assessed that accuracy would continue to taper off as the number of boosting iterations increased, so I stuck with the chosen tuned value of 150. Now onto predicting on the test set and building another confusion matrix:
boost.music <- predict(gbmFit1, music_test) #predicting on test data
#building confusion matrix to compare predictions to actual test data
confusionMatrix(reference = music_test$music_genre, data = boost.music, mode='everything')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alternative Anime Blues Classical Country Electronic Hip-Hop Jazz
## Alternative 411 30 47 24 72 81 36 44
## Anime 4 752 75 23 8 49 1 13
## Blues 14 62 486 30 32 77 2 130
## Classical 2 55 17 852 0 3 0 39
## Country 101 18 90 13 567 33 9 59
## Electronic 51 58 49 17 36 595 13 97
## Hip-Hop 82 0 5 0 14 24 453 32
## Jazz 73 16 166 35 87 87 16 532
## Rap 58 2 4 0 9 21 408 15
## Rock 204 7 61 6 175 30 62 39
## Reference
## Prediction Rap Rock
## Alternative 48 92
## Anime 2 3
## Blues 0 2
## Classical 0 3
## Country 2 59
## Electronic 3 3
## Hip-Hop 411 42
## Jazz 12 13
## Rap 418 54
## Rock 104 729
##
## Overall Statistics
##
## Accuracy : 0.5795
## 95% CI : (0.5698, 0.5892)
## No Information Rate : 0.1
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5328
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Alternative Class: Anime Class: Blues
## Sensitivity 0.4110 0.7520 0.4860
## Specificity 0.9473 0.9802 0.9612
## Pos Pred Value 0.4644 0.8086 0.5820
## Neg Pred Value 0.9354 0.9727 0.9439
## Precision 0.4644 0.8086 0.5820
## Recall 0.4110 0.7520 0.4860
## F1 0.4361 0.7793 0.5297
## Prevalence 0.1000 0.1000 0.1000
## Detection Rate 0.0411 0.0752 0.0486
## Detection Prevalence 0.0885 0.0930 0.0835
## Balanced Accuracy 0.6792 0.8661 0.7236
## Class: Classical Class: Country Class: Electronic
## Sensitivity 0.8520 0.5670 0.5950
## Specificity 0.9868 0.9573 0.9637
## Pos Pred Value 0.8774 0.5962 0.6453
## Neg Pred Value 0.9836 0.9521 0.9554
## Precision 0.8774 0.5962 0.6453
## Recall 0.8520 0.5670 0.5950
## F1 0.8645 0.5812 0.6191
## Prevalence 0.1000 0.1000 0.1000
## Detection Rate 0.0852 0.0567 0.0595
## Detection Prevalence 0.0971 0.0951 0.0922
## Balanced Accuracy 0.9194 0.7622 0.7793
## Class: Hip-Hop Class: Jazz Class: Rap Class: Rock
## Sensitivity 0.4530 0.5320 0.4180 0.7290
## Specificity 0.9322 0.9439 0.9366 0.9236
## Pos Pred Value 0.4262 0.5130 0.4226 0.5145
## Neg Pred Value 0.9388 0.9478 0.9354 0.9684
## Precision 0.4262 0.5130 0.4226 0.5145
## Recall 0.4530 0.5320 0.4180 0.7290
## F1 0.4392 0.5223 0.4203 0.6032
## Prevalence 0.1000 0.1000 0.1000 0.1000
## Detection Rate 0.0453 0.0532 0.0418 0.0729
## Detection Prevalence 0.1063 0.1037 0.0989 0.1417
## Balanced Accuracy 0.6926 0.7379 0.6773 0.8263
Here we see the boost model is 57% accurate, which is an improvement, however small, from the previously trained random forest model. Once again, hip-hop and rap were greatly confused for each other, alongside country/ rock, jazz/ blues, and alternative/ rock. Looking at variable importance once more:
varImpBoost<-varImp(gbmFit1) #ordered variable importance
ggplot(varImpBoost, main="Variable Importance with BOOST") #plotting importance
In the boosted model, predictors popularity, loudness, speechiness, and danceability are the most important, just like in the random forest model. Once again, liveness and mode are the least important features of the model, although now the least important variables as well as tempo, duration, and energy are far less important that in the random forest model. This model being more picky with its predictors is actually a good sign in terms of it being a robust model.
k-NN
Training and tuning the k nearest neighbors model only took around 40 minutes. Again, I used 10 fold cross validation repeated thrice, and wrote the model to an external file.
#knnFitControl <- trainControl(## determine k for best number of neighbors
# method = "repeatedcv",
# ## 10 folds
# number= 10,
## repeated three times
# repeats = 3)
#knnFit1 <- train(music_genre ~ ., data = music_train,
# method = "knn",
# trControl = knnFitControl)
#knnFit1
#save(knnFit1, file = "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/knnModel.rda")
The only parameter being tuned in this model is of course k, or the ideal number of neighbors. Loading it in:
load("knnModel.rda")
and plotting the performance:
ggplot(knnFit1)
The accuracy looks like it would have continued to increased linearly as the number of neighbors increased, and I’m really not sure why that is. I chose to stick with the tuned value chosen by the model to avoid overcomplicating things.
Once again, onto prediction and confusion matrix.
knn.music <- predict(knnFit1, music_test)
confusionMatrix(reference = music_test$music_genre, data = knn.music, mode="everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alternative Anime Blues Classical Country Electronic Hip-Hop Jazz
## Alternative 299 35 64 15 97 102 62 70
## Anime 8 669 93 37 9 69 0 22
## Blues 35 86 403 23 52 60 1 152
## Classical 1 60 15 847 3 8 0 54
## Country 166 54 157 27 575 74 30 110
## Electronic 65 58 47 14 23 492 24 79
## Hip-Hop 110 2 9 1 23 50 406 43
## Jazz 52 28 160 34 58 92 12 427
## Rap 76 1 2 0 24 24 404 22
## Rock 188 7 50 2 136 29 61 21
## Reference
## Prediction Rap Rock
## Alternative 62 168
## Anime 0 4
## Blues 3 21
## Classical 0 6
## Country 26 170
## Electronic 13 20
## Hip-Hop 445 46
## Jazz 14 26
## Rap 325 49
## Rock 112 490
##
## Overall Statistics
##
## Accuracy : 0.4933
## 95% CI : (0.4835, 0.5031)
## No Information Rate : 0.1
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.437
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Alternative Class: Anime Class: Blues
## Sensitivity 0.2990 0.6690 0.4030
## Specificity 0.9250 0.9731 0.9519
## Pos Pred Value 0.3070 0.7344 0.4821
## Neg Pred Value 0.9223 0.9636 0.9349
## Precision 0.3070 0.7344 0.4821
## Recall 0.2990 0.6690 0.4030
## F1 0.3029 0.7002 0.4390
## Prevalence 0.1000 0.1000 0.1000
## Detection Rate 0.0299 0.0669 0.0403
## Detection Prevalence 0.0974 0.0911 0.0836
## Balanced Accuracy 0.6120 0.8211 0.6774
## Class: Classical Class: Country Class: Electronic
## Sensitivity 0.8470 0.5750 0.4920
## Specificity 0.9837 0.9096 0.9619
## Pos Pred Value 0.8521 0.4140 0.5892
## Neg Pred Value 0.9830 0.9506 0.9446
## Precision 0.8521 0.4140 0.5892
## Recall 0.8470 0.5750 0.4920
## F1 0.8495 0.4814 0.5362
## Prevalence 0.1000 0.1000 0.1000
## Detection Rate 0.0847 0.0575 0.0492
## Detection Prevalence 0.0994 0.1389 0.0835
## Balanced Accuracy 0.9153 0.7423 0.7269
## Class: Hip-Hop Class: Jazz Class: Rap Class: Rock
## Sensitivity 0.4060 0.4270 0.3250 0.4900
## Specificity 0.9190 0.9471 0.9331 0.9327
## Pos Pred Value 0.3577 0.4729 0.3506 0.4471
## Neg Pred Value 0.9330 0.9370 0.9256 0.9427
## Precision 0.3577 0.4729 0.3506 0.4471
## Recall 0.4060 0.4270 0.3250 0.4900
## F1 0.3803 0.4488 0.3373 0.4676
## Prevalence 0.1000 0.1000 0.1000 0.1000
## Detection Rate 0.0406 0.0427 0.0325 0.0490
## Detection Prevalence 0.1135 0.0903 0.0927 0.1096
## Balanced Accuracy 0.6625 0.6871 0.6291 0.7113
The accuracy of k-NN model with k=9 comes in at 49%. This model really seemed to confuse hip for rap and vice versa, as with all my previous models. In fact, it seemed to confuse more genres for other than any of my other models, like hip-hop/alternative, jazz/country, and country/blues. Overall, it was most sensitive to classical music, which makes sense as it was continuously differentiated from other genres as seen in the EDA.
varImpKNN<-varImp(knnFit1) #order of variable importance
ggplot(varImpKNN, main="Variable Importance with k-NN") #plotting variable importance
A bit hard to see, but popularity is definitely most important feature across genres, besides rock which benefits slightly more from loudness. Interestingly, loudness is also important for identifying anime and electronic tracks. Loudness is not nearly as important for genres like hip-hop, rap, jazz, blues, classical, and country.
SVM
I chose not to conduct PCA and simply train the SVM model on the data as is, as my data had a mix of continuous and categorical (key and mode) data, and upon researching and seeing conflicting opinions, I learned that typically PCA is not useful when data has One Hot Encoded variables (all 0s and 1s), and it can skew the weight of the principal components.
Training and tuning this model took a bit over an hour, so I wrote it to an external file as before:
#svmFitControl <- trainControl(## 10-fold CV
# method = "repeatedcv",
# number = 10,
# ## repeated three times
# repeats = 3)
#svmFit1 <- train(music_genre ~ ., data = music_train,
# method = 'svmLinear', #chose a linear kernel for classification
# trControl = svmFitControl)
#svmFit1
#save(svmFit1, file = "/Users/lailaelgamiel/Desktop/PSTAT131/131FinalProject/svmModel.rda")
There was only one parameter to tune here (C, as the kernel was linear).
Loading in the final model:
load("svmModel.rda")
svm.music <- predict(svmFit1, music_test)
confusionMatrix(reference = music_test$music_genre, data = svm.music, mode='everything')
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alternative Anime Blues Classical Country Electronic Hip-Hop Jazz
## Alternative 317 29 32 15 69 70 74 36
## Anime 5 679 160 57 19 83 1 41
## Blues 17 77 406 25 76 66 1 130
## Classical 5 82 11 824 1 8 0 54
## Country 168 24 86 13 490 41 17 78
## Electronic 97 79 66 30 56 574 25 104
## Hip-Hop 94 1 1 0 14 40 517 35
## Jazz 80 23 187 29 92 79 17 482
## Rap 47 1 1 0 5 18 286 6
## Rock 170 5 50 7 178 21 62 34
## Reference
## Prediction Rap Rock
## Alternative 77 115
## Anime 2 3
## Blues 0 3
## Classical 0 2
## Country 22 63
## Electronic 10 10
## Hip-Hop 424 26
## Jazz 13 24
## Rap 344 65
## Rock 108 689
##
## Overall Statistics
##
## Accuracy : 0.5322
## 95% CI : (0.5224, 0.542)
## No Information Rate : 0.1
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4802
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Alternative Class: Anime Class: Blues
## Sensitivity 0.3170 0.6790 0.4060
## Specificity 0.9426 0.9588 0.9561
## Pos Pred Value 0.3801 0.6467 0.5069
## Neg Pred Value 0.9255 0.9641 0.9354
## Precision 0.3801 0.6467 0.5069
## Recall 0.3170 0.6790 0.4060
## F1 0.3457 0.6624 0.4509
## Prevalence 0.1000 0.1000 0.1000
## Detection Rate 0.0317 0.0679 0.0406
## Detection Prevalence 0.0834 0.1050 0.0801
## Balanced Accuracy 0.6298 0.8189 0.6811
## Class: Classical Class: Country Class: Electronic
## Sensitivity 0.8240 0.4900 0.5740
## Specificity 0.9819 0.9431 0.9470
## Pos Pred Value 0.8349 0.4890 0.5461
## Neg Pred Value 0.9805 0.9433 0.9524
## Precision 0.8349 0.4890 0.5461
## Recall 0.8240 0.4900 0.5740
## F1 0.8294 0.4895 0.5597
## Prevalence 0.1000 0.1000 0.1000
## Detection Rate 0.0824 0.0490 0.0574
## Detection Prevalence 0.0987 0.1002 0.1051
## Balanced Accuracy 0.9029 0.7166 0.7605
## Class: Hip-Hop Class: Jazz Class: Rap Class: Rock
## Sensitivity 0.5170 0.4820 0.3440 0.6890
## Specificity 0.9294 0.9396 0.9523 0.9294
## Pos Pred Value 0.4488 0.4698 0.4450 0.5204
## Neg Pred Value 0.9454 0.9423 0.9289 0.9642
## Precision 0.4488 0.4698 0.4450 0.5204
## Recall 0.5170 0.4820 0.3440 0.6890
## F1 0.4805 0.4758 0.3880 0.5929
## Prevalence 0.1000 0.1000 0.1000 0.1000
## Detection Rate 0.0517 0.0482 0.0344 0.0689
## Detection Prevalence 0.1152 0.1026 0.0773 0.1324
## Balanced Accuracy 0.7232 0.7108 0.6482 0.8092
The SVM model has an accuracy of about 53%, almost as accurate as the random forest model. Let us check out which variables are most important for SVM:
varImpSVM<-varImp(svmFit1)
ggplot(varImpSVM, main="Variable Importance with Support Vector Machines (SVM)")
Similarly to k-NN, rock favored loudness slightly, but popularity beat out all other predictors in terms of importance between genres.
Now to compare each model’s performance against eachothers:
# Compare model performances using resample()
# Could not compare against random forest model as I could not do repeated CV
# thus resulting in a different number of resamples between these models and RF.
models_compare <- resamples(list(BOOST=gbmFit1, KNN=knnFit1, SVM=svmFit1))
# Summary of the models performances
summary(models_compare)
##
## Call:
## summary.resamples(object = models_compare)
##
## Models: BOOST, KNN, SVM
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## BOOST 0.56050 0.5736250 0.578125 0.5783417 0.5830625 0.59075 0
## KNN 0.46675 0.4842500 0.489375 0.4874000 0.4921875 0.49775 0
## SVM 0.51775 0.5313125 0.535250 0.5353917 0.5392500 0.55125 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## BOOST 0.5116667 0.5262500 0.5312500 0.5314907 0.5367361 0.5452778 0
## KNN 0.4075000 0.4269444 0.4326389 0.4304444 0.4357639 0.4419444 0
## SVM 0.4641667 0.4792361 0.4836111 0.4837685 0.4880556 0.5013889 0
Plotting the above comparisons for visual ease:
scales <- list(x=list(relation="free"), y=list(relation="free"))
bwplot(models_compare, scales=scales)
Thus, the boosted model is the most accurate of the three models plotted here, and when comparing the accuracy of the boosted model (57%) to the accuracy of the random forest model (54%), I can see that the boosted model still beats it out.
CHOSEN MODEL BASED ON ACCURACY: BOOST
final_model <- gbmFit1 #chosen model is boosted model
Checking a few predictions
predict(final_model, newdata = head(music_test)) #predicted values
## [1] Electronic Blues Electronic Electronic Anime Alternative
## 10 Levels: Alternative Anime Blues Classical Country Electronic ... Rock
head(music_test$music_genre) #actual test values
## [1] Electronic Electronic Electronic Electronic Electronic Electronic
## 10 Levels: Alternative Anime Blues Classical Country Electronic ... Rock
It seems even the best model was only able to correctly predict 3 out of the first 6 entries of the test set, (~50% accuracy).
predict(final_model, newdata = tail(music_test))
## [1] Hip-Hop Hip-Hop Rap Rap Rap Hip-Hop
## 10 Levels: Alternative Anime Blues Classical Country Electronic ... Rock
tail(music_test$music_genre)
## [1] Hip-Hop Hip-Hop Hip-Hop Hip-Hop Hip-Hop Hip-Hop
## 10 Levels: Alternative Anime Blues Classical Country Electronic ... Rock
Of the last 6 entries of the test set, the model once again correctly predicted the genre 3 out of 6 times.
Overall, of the four models I built and trained on the final cleaned and transformed dataset, the boosted tree model performed best based on accuracy of predictions on the test set, but not by much. It seems that regardless of the model I chose to train and test, the accuracy never seemed to cross the threshold of 60%. This led me to a few different conclusions.
First, I do not believe any of my models were particularly effective because it seems that among the genres present in the dataset, quite a few shared too many similarities to be properly separated by the model. In particular, rap and hip-hop, rock and country, and jazz and blues seemed to always be lumped together. This is not necessarily surprising, as I personally would have a difficult time distinguishing between some of these genres depending on the track. I do believe some of these models would have performed much better differentiating between genres if certain combinations of genres were lumped together. That is certainly an idea for the next steps I could take if I were to continue this analysis of genre. Perhaps if I grouped certain genres together, I could build a model that was more accurate, or I could choose to build a model that was focused only on being incredibly good at identifying whether or not a track fell under a specific genre, like classical.
This leads me to my second conclusion. I noticed that regardless of which model I looked at, classical was always significantly more distinguishable than other genres. The boosted model was also fairly sensitive to anime and rock, which generally were indistinguishable while analyzing spreads across genres in my EDA. This was one of the more surprising things about my analysis.
Finally, I am able to conclude that while certain inherent characteristics of music lead to the specification of a genre, the more specific the specifications of a genre gets, the more difficult it is to categorize music only by what genre it falls under. You could theoretically come up with thousands of sub genres underneath any single genre, as record companies often do, but it wouldn’t be nearly as useful of a classification tool at that point. While this model was interesting to examine, and showed me just how complex and varied the nature of music is, it was not super useful in helping classify songs. Music as a whole is much more enjoyable when we don’t give too much thought to genre, anyways.
Thanks for a great quarter!